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Abstract 

Reference-free SNP detection, that is identifying SNPs between samples directly from comparison of primary 
sequencing data with other primary sequencing data and not to a pre-assembled reference genome is an 
emergent and potentially disruptive technology that is beginning to open up new vistas in variant identification 
that reveals new applications in non-model organisms and metagenomics. The modern, effcient data structures 
these tools use enables researchers with a reference sequence to sample many more individuals with lower 
computing storage and processing overhead. In this article we will discuss the technologies and tools 
implementing reference-free SNP detection and the potential impact on studies of genetic variation in model and 
non-model organisms, metagenomics and personal genomics and medicine. 




Genomics 



SNP calling pipelines rely on a reference genome 

It is no overstatement to say that the recent technological 
advances that have made it possible to sample whole gen- 
omes many times over has changed forever the way that 
geneticists and genomicists design and carry out their 
experiments. Many of these experiments require the detec- 
tion of genetic variants as a preliminary, most often single 
nucleotide polymorphisms (SNPs). The basis of all these 
experiments is the same: one sequence must be compared 
with another. The predominant model for this is to have a 
single genome assembly chosen as the baseline against 
which all others will be compared. Often this reference gen- 
ome will have been produced in a large-scale 'big-biology' 
project by a large consortium using long read Sanger-style 
sequencing or a hybrid approach that mixes long and short 
reads, but the defining characteristic will be that a great 
deal of time and expense has gone into preparing the refer- 
ence. To avoid this expense later, the genome is not 
assembled for each further individual sample, instead the 
sequence reads will be aligned directly to the single 
assembled reference irrespective of any assembly errors or 
structure specific to the population or individual sequenced. 
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The consensus from the sequence reads will then be taken 
as the new sample's genome and SNPs are called using 
sophisticated statistical methods based on sequence and 
alignment quality metrics. 

The De Bruijn graph as a data-structure for 
identifying variants 

Recently, a new model that is predominantly based on De 
Bruijn graphs, has emerged that removes the need for a 
reference genome and uses comparisons of the raw 
sequence directly. De Bruijn graphs are directed graphs of 
overlapping symbols that are well suited to representing 
ordered relationships between same length sequences, 
such as sub-sequences of sequence reads (see Figure 1). 
De Bruijn graphs have proven to be of great utility as the 
underlying data model over which almost all de novo 
assembly algorithms designed to use short read data have 
been implemented. Some examples of these tools are 
Velvet [1], ABySS [2], Euler [3], SOAPdenovo [4] and ALL- 
PATHS [5]. In the majority of these assembly tools the De 
Bruijn graph is implemented as an internal data structure 
that represents a network in which the linked entities are k- 
mers from the sub-sequences of sequence reads and links 
are made between /r-mers that overlap by /c-1 with a single 
nucleotide overhang (in some implementations the /c-mer 
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>read_l 

CGATTCTAAGTGTACTGC... 



1. Break the reads into overlapping bits of length k (k-mers) 

2. Make each k-mer a node in the graph 

3. Make links between overlapping kmers 

4. Follow paths 




CGATTCTAAGT 

Figure 1 De Bruin graphs constructed from overlapping k-mers. De Bruijn graphs are networks of short overlapping sub-sequences of reads 
of length k. Typically, k-mers are set as the nodes in the graph and links are drawn between /t-mers that have overlap of length k - 1, that is 
they overhang each other by just one nucleotide. 



is the link, rather than the entity). In a situation with no 
read errors and with k long enough to include the longest 
repeat in a single k-mev then it is theoretically possible to 
reconstruct the genome by following from /c-mer to k-mer 
along the edges, passing through each £-mer only once. 
Since sequencing data do contain errors and genomic 
repeats can be very long (many times longer than fre- 
quently used read length, let alone the length of k used) 
then the read errors can cause dead ends in the path that 
terminate extension of the contig, repeats in the genome 
can cause cycles in the graph. True genetic variation will 
also cause these sub-structures. The different ways in 
which the De Bruijn graph based assembly programs get 
around these problems, essentially how they throw out var- 
iation-like structure in the graph, is what separates them 
and is the area on which assembler research is currently 
focussed. By focussing on these structures, rather than 
throwing them away, we can identify variation directly 
from the graph of £-mers. The basic distinct structure that 
indicates the presence of a SNP is described as a 'bubble' 
(see Figure 2). A bubble is caused by a single nucleotide dif- 
ference at the end of two /c-mers that causes a closing bifur- 
cation within the graph such that each path around the 
bubble indicates a different allele. SNPs can be identified by 
searching the De Bruijn graph for these bubble structures. 
These bubbles may also have long leading or trailing paths 
of nucleotides, meaning SNPs can be embedded in long 
contigs. 

Limitations of short-read alignment based variant 
calling 

With current tools the SNP content of whole genomes of 
many organisms can be compared very widely in just 
hours and the differences between them catalogued almost 
as a matter of course. As such we have vast new horizons 
of genetic variation to explore. The new vistas offered by 
advances in sequencing are not as broad as they might at 
first seem and there are significant disadvantages to the 
usual SNP- by- alignment approach. 



A typical SNP and INDEL calling pipeline based on 
alignment to a reference is restricted by the reference. 
Generally, deletions and only small variations of that cano- 
nical sequence can be found whereas larger structural 
rearrangements, such as inversions or large insertions can- 
not be, without significant further analysis, such as reas- 
sembly of the genome or any non-mapping reads. The 
reads are too small to show these within themselves and 
any mate-pair orientation or read insert size information 
that points towards this will often be rejected since it does 
not fit in with the a priori quality parameters of the align- 
ment pipeline. As a consequence sequence reads from 
such areas are rejected and the regions cannot be discov- 
ered. A reference-free de novo assembly approach can 
discover small changes like these as new or sample depen- 
dent sections in the graph. Reference-free methods are not 
the only approach and a number of techniques incorporat- 
ing reference-based mapping have also been published 
that seek to identify more complex structural variations 
through additional algorithmic layers built on top of 
standard mapping tools such as BWA [6-8]. 

Multiple recent studies are now suggesting that an 
individual's genomes may be much more mosaic-like 
than previously thought [9] and that important variations 
may be a result of larger structural changes rather than 
simply substitutions in the sequence. A reference-free 
model allows us to find such variations as a series of por- 
tions of the graph apparent in only one or a few samples. 
Such approaches also offer greater power to identify rarer 
events, for example in comparisons of parental and child 
genomes and exomes, due to the algorithms giving less 
reliance to simple coverage based metrics. 

Many scientists work with non-model organisms that 
lack an assembled reference genome, so when work that 
aims to identify variation between samples a typical work- 
flow is to build a working reference genome. Completed 
or even relatively good drafts are time consuming and 
expensive to build from next generation sequencing 
technologies and as the genome sequence is only a means 
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Figure 2 Bubble structures formed in De Bruijn graphs by SNPs. Bubble structures form as the result of a divergence in sequence by one 
nucleotide, initially at the end of a /c-mer, that then moves backwards at each progressive node, allowing for a close of the two paths at the 
end. Colouring the edges in the graph according to sample provenance helps identify inter-sample SNPs. 



to an end in these cases then often primary contigs or 
first-pass scaffolds will be used. Ultimately, this can be of 
detriment to the final aim as poor quality and incomplete 
reference means that read alignment can be badly sub- 
optimal and generate many false positives or miss much of 
the variation. There are systematic errors that occur in 
alignment-based SNP detection that rely particularly on 
reference genome quality and the alignment parameters: 
three valid SNPs, equally-spaced in the read, can mean no 
SNPs found at all if the analysis specifies a two mismatch 
cutoff per read. A reference-free method that captures all 
information from the reads can potentially avoid the 
sequence problems of an error-prone early draft genome 
and reduce expense from the time and effort required to 
create one and therefore potentially maximise the power 
of variant calling. 

Metagenomic data provide an even greater challenge 
to the task of SNP calling. Samples typically consist of a 
mixture of genomes at widely varying abundance levels. 
Even with extremely deep sequencing, the genomes of 
some species will be incomplete in the read set. The dif- 
ficulty of estimating read depth, combined with the like- 
lihood that many of the genomes will be from closely 
related species makes SNP calling based on depth extre- 
mely problematic. Thus, reference-free algorithms that 
take account of local graph topology or locally derived 
coverage statistics have the potential to produce results 
where a standard alignment based approach would fail. 

Data storage and archiving is a major problem for 
alignment based methods of SNP calling. The major 
recording method for alignments is the SAM format, a 
record-per-alignment format whose file size increases 



with the number of reads that have been used in the align- 
ment. Although clever compression and indexing to BAM 
format and others [10] alleviates some of the pain, the sto- 
rage of alignment data for many samples consumes large 
amounts of expensive disk-space, takes a long time to 
move across networks for processing or sharing, and 
requires that storage continually expands as new samples 
are added. As these files can be many gigabytes in size for 
individual samples, it is not hard to see how a medical 
centre trying to deal with many hundreds of patients data 
in a clinical diagnostic setting would perhaps struggle to 
keep up with demands on that data. Nor is it hard to see 
how a population genomics study would need to rely on 
massive amounts of storage for its many data sets, thereby 
limiting the sample that could actually be taken and the 
science that could be done. Using a De Bruijn graph to 
hold sequence information can potentially be much more 
compact than the conventional per-read alignment in 
BAM files since each k-mer of the original sequence need 
only be represented once. Per sample counts of occurence 
of that k-mer are recorded and the structure of the graph 
encoded. New A"-mers not seen in previous samples 
increase the size of the files more substantially as they, 
their counts and the changes to the graph structure must 
be recorded, but the overall growth is many times less 
than that taken up by per-alignment storage methods. A 
significant disadvantage of the De Bruijn graph approaches 
developed so far is that individual read information is lost 
in the preparation of the graph. Scientists still like to be 
able to check each read manually when calling SNPs, 
often to convince themselves of the quality of the calls and 
rule out obvious errors in SNP calling in a very ad hoc 
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manner, reads cannot yet be reconstituted from De Bruijn 
graphs as they can from BAM alignments so a dual storage 
is needed. Although the cost of sequencing has stopped 
declining at the rate it has done over the last few years, the 
low cost of the actual sequence is still a major attraction to 
laboratories interested in these sorts of experiments. The 
cost of analysis and storage of data is much more often 
overlooked and constitutes a substantial investment too. 
While it is true that for alignment-based SNP detection it 
is arguable that we need only store BAM files for as long 
as it takes to calculate a result and that final output can be 
stored more efficiently than the reads, such an approach 
precludes the possibility of truly novel discovery that the 
De Bruijn graph approach makes possible and is also not 
amenable to easy re-analysis. 

Although being able to call SNPs and other variants 
without a reference is very useful, when a reference is 
available it can add much to analyses using these sys- 
tems as it can provide a very useable guide set of 
k-mers that SNP finding without losing any of the 
advantages of the reference-free model. 

Strategies for identifying variants from the 
structure of a De Bruijn graph 

Depth-first 

Iqbal et al. produced the first methods for detecting var- 
iants and distinguishing populations or individuals by 
finding alleles with the use of a differential 'colouring' 
technique on the graph [11]. A key component of their 
strategy is to add a nominal colour to the Aimers that 
come from each sample so that uniquely 'coloured' seg- 
ments in the graph then represent regions unique to one 
sample, naturally this approach is well suited to describ- 
ing insertions or deletions as well as SNPs. Variant iden- 
tification from a De Bruijn graph relies on finding the 
structures that correspond to different classes of variant. 
Since the SNPs manifest as bubble structures within the 
graph, one way of finding these structures is to use a 
graph traversal algorithm, such as the clean bubble caller 
employed in Iqbal et al., and the more computationally 
intensive depth-first search method in Leggett et al. [12]. 
These algorithms proceed by marking the nodes in the 
graph that have at least two edges departing from them 
as starting points and following nodes sequentially (graph 
walking) until a node with at least two edges entering it 
is reached. Two such paths from the same origin to the 
same destination, with different colours delineate a 
potential bubble and therefore SNP between samples 
(Figure 2). A complication occurs when the paths split 
again before they can resolve, resulting in an large 
increase in search space and therefore compute time. In 
the case of Iqbal et al., path complexity can be restricted 
with the use of a reference, while the solution for Leggett 
et al. is to limit the number of branching paths that are 



followed with a maximum depth parameter and thereby 
allow the algorithm to complete. A significant source of 
false-positives for this method comes from the fact that 
dual out-edges can be created by genome structure that 
is not just a SNP- any pair of k-mers that differ by a sin- 
gle end nucleotide causes the key bifurcation thus 
repeats, inversions and other low-complexity genome 
sequence can be problematic to differentiate from the 
desired SNPs. 

This approach is not just restricted to identifying 
SNPs. Iqbal et al also showed the power of the approach 
for finding INDELs and larger scale variants between 
samples, though power of the algorithm is increased for 
complex INDELs by using a reference. 

Topological 

Another graph structure, or topological method that has 
appeared recently has taken a very strict approach to the 
definition of the bubble structures and therefore repre- 
sents a potentially very effective SNP detection tool. In the 
2k + 2 approach, bubbles are identified by decomposing 
the De Bruijn graph to an undirected graph and then iden- 
tifying bubbles as Eulerian and Hamiltonian cycles, i.e 
paths in bubble sized subgraphs {2k) that visit each node 
and edge just once on their way back to the starting node. 
This method's very precise description of a bubble means 
that it has the potential to be extremely specific and gener- 
ate highly accurate lists of SNPs from graph structure 
alone (Younsi et al, unpublished) (2k + 2), however the 
computational time required is great and sensitivity must 
be compromised, the entire graph cannot be searched and 
random start sites within the graph must be selected 
repeatedly to prevent ample coverage, furthermore as 
graph complexity increases the time to complete increases 
badly too, rendering these approaches likely unsuitable for 
metagenomics applications or situations were complete- 
ness of SNP detection, rather than simpler marker selec- 
tion is needed. 

Microassembly 

Peterlongo et al. [13] adopt a microassembly approach to 
SNP detection. They begin by producing a tree of k-mers 
for an input read set and then build bubble microassem- 
blies (or 'mouth' structures in their terminology) based 
on the k-mer set. The algorithm consists of first picking a 
seed k-mer by choosing the k-mer with the lowest count 
that occurs more in one set than another. A low count 
k-mer is chosen to avoid repeat structures, while it his 
hoped that a differing set count may be indicative of a 
SNP. Having chosen a seed, the algorithm assumes that it 
lies on one path through a SNP and then looks for an 
opposite k-mer, one substitution different, which would 
lie on another path through the bubble. If this can be 
found in the k-mer tree, then a recursive algorithm builds 
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paths left and right of each k-mer until they join up or no 
k-mer can be found in the tree. Once a bubble has been 
identified, it is checked for read coherence by checking 
for coverage from at least two reads. The approach has a 
significant memory saving and speed advantage over 
techniques that build a complete De Bruijn graph, as the 
initial tree structure does not include connectivity infor- 
mation. However, the method does not handle heterozy- 
gous SNPs and is limited to pairwise comparisons 
for data sets containing more than two samples. An 
improved version of the technique is currently being pre- 
pared for publication that addresses some of these limita- 
tions ([14] and R. Uricaru, pers. comm.). 

Maximum likelihood methods 

The Stacks tool developed by Catchen et al. [15] is 
designed for restriction enzyme based sequencing proto- 
cols such as RAD-Seq (Restriction site Associated 
DNA), a technique designed to sample from the same 
reduced and representational fraction of the genomes of 
multiple individuals of a population [16,17]. The method 
begins by generating 'stacks' of reads at particular loci, 
which form because of the nature of the restriction 
enzymes used. These stacks are then merged with 
nearby stacks using either a de novo or a reference 
based approach. SNPs are called from stacks using a sin- 
gle-nucleotide, diploid maximum liklihood genotyping 
algorithm designed for RAD-Seq data by Hohenlohe et 
al. [18]. Related work by Dou et al. take this ML 
method further, publishing what is described as an 
improved maximum liklihood algorithm (iML) [19]. 
They note that in a Stacks experiment, short reads from 
repetetive regions cluster together and are difficult to 
distinguish, potentially resulting in false SNP calling. 
They observe that the distribution of composite clusters 
should show a repeating pattern corresponding to copy 
number variations of the repeating elements. This is 
modelled with a Poisson distribution and the model 
combined with the Hohenlohe et al. ML algorithm to 
exclude repetetive loci from SNP calling. 

Contig based graphs 

Approaches specifically targetted at metagenomic data 
have been rare, due to the complexity of such datasets. 
However Nijkamp et al. have recently published a tool 
called MARYGOLD which aims to identify complex bub- 
bles in and between metagenomic samples [20]. Their 
approach is based on contig graphs rather than De Bruijn 
graphs. In such graphs, the nodes represent contigs and 
the edges represent reads that span connected contigs. In 
metagenomic datasets, multiple edges indicate sequence 
divergence of the same or related species and, as with De 
Bruijn graphs, these edges typically reconverge forming 
bubble structures. The MARYGOLD approach is to 



identify these regions and collapse them to form contigs 
representing multiple alleles. The contigs are output in a 
graph structure that represents the multiple alleles and 
output in Circos format is also supported. 

Implementations 

There are already a number of tools already available for 
reference-free SNP calling (see Table 1). The most sig- 
nificant tool yet implemented for reference free analysis 
is Cortex - in which Iqbal et al implemented fast and 
highly memory efficient data structures and algorithms 
for general variant detection, and is the first implemen- 
tation that makes use of explicitly coloured De Bruijn 
graphs and incorporates a de novo assembly algorithm. 
Given a specific genome of interest, a variant-size and a 
k-mer size, the power to detect variants was modelled, 
interestingly the authors noted that without a reference, 
it was difficult to distinguish read errors and repeats 
from genuine SNPs. An extension module to Cortex 
named Bubbleparse handles this problem explicitly by 
incorporating extra information on genetic background 
and applying a maximum likelihood heuristic to rank 
candidate bubbles and thereby increase accuracy of SNP 
calling so that false-postive calls can be reduced by five- 
fold relative to the base algorithm, moving from 50 - 60 
% false positive unranked to5 - 10 % after ranking. Bub- 
bleparse can also select and prioritise SNPs that satisfy 
prior information about genetic background, meaning 
that it is particularly suited to the identification of SNPs 
de novo from specific genetic crosses. 

Pipelines have also been developed for the identifica- 
tion of polymorphisms de novo and for tracking of 
genetic loci from populations sequenced using the RAD 
tag approach, a reduced representation sequencing pro- 
cess, Stacks [15] is a comprehensive toolkit that uses a 
non De Bruijn graph method and instead uses sample 
categorisation (equivalent to colours in a De Bruijn 
graph) to group reads from the tags that are decom- 
posed into a k-mer list, or stack. A similarity graph 
between stacks is created to reduce their number and to 
merge stacks to give the potential SNP loci. A maximum 

Table 1 de novo reference-free analysis software and 



availability. 



Cortex 


http://cortexassembler.sourceforge.net/ 


Bubbleparse 


https://github.com/richardmleggett/ 




bubbleparse 


Bubbleparse 


https://github.com/danmaclean/Bubble-Parse 


accessories 




2k +2 


http://sourceforge.net/projects/twokplustwo/ 


NIKS 


http://sourceforge.net/projects/niks/ 


discoSnp 


http://colibread.inria.fr/discosnp/ 


Stacks 


http://creskolab.uoregon.edu/stacks/ 


MaryGold 


http://sourceforge.net/projects/metavar/ 



Leggett and MacLean BMC Genomics 2014, 15(Suppl 4):S10 
http://www.biomedcentral.com/1471-2164/15/S4/S10 



Page 6 of 7 



likelihood consensus is used to infer the SNPs and 
genetic cross data is incorporated to construct haplo- 
types and create comprehensive, dense genetic maps. 

The more recent NIKS, needle in a k stack, tool [21] is 
a related but much more directed procedure that does 
not aim to call a wide range of SNPs or produce a map. 
Rather it aims to narrow down the SNPs that corre- 
spond to mutations induced after a mutant screen and 
sequencing from simple genetic back crosses, especially 
those performed for purifying a mutation after EMS 
mutagenesis of the model plant Arabidopsis thaliana. 
Non-candidate SNPs are discarded by the method. 

The 'mouth work of Peterlongo et al resulted in the tool 
kisSnp, which evolved into kisSnp2 and is now part of dis- 
coSnp [14,22]. discoSnp combines the SNP predictions 
from kisSnp2 with a second module which computes 
mean coverage and average quality per read set of the 
reads generating the SNP. 

Impact of reference-free SNP calling on the 
biomedical sciences 

The impact of reference-free SNP calling promises to be 
significant. Genetic studies in non-model organisms will be 
facilitated greatiy - with these methods it is now possible to 
start assaying genetic variability across many samples with- 
out the time and expense of building a reference genome. 
This alone brings the technology to smaller groups with 
fewer resources but also allows new questions to be asked 
across wider phylogenetic groupings than is possible. 
Although the reference free methods are just as sensitive to 
nucleotide distance between samples and large genetic dis- 
tances cannot be assayed any more effectively than with 
alignment based methods, virtually any organism can be 
compared against close relatives with just the sequencing 
and much lower computational overhead. Since many of 
the reference-free tools are based on the same De Bruijn 
graph structure as de novo short read assemblers the SNP 
calls generated can be embedded in a long species-specific 
contig, an attribute that promises to have great use in the 
field of plant breeding. Domesticated crop species typically 
have very large, highly repetitive genomes that have been 
refractory to attempts to generate completed genome 
sequences. A key aim in plant breeding is to identify the 
genetic variation that contributes to agronomically impor- 
tant traits so that it can be selected effectively and intro- 
duced into commercially important crop lines. Closely 
related wild relatives of crop species, a great example is 
wheat and related wild grasses, are reservoirs of genes that 
can confer resistance to pathogens. By using reference-free 
SNP approaches plant breeders can access a much wider 
pool of resistance related genes since the alleles that associ- 
ate with resistance in some individuals or populations can 
be assayed much more quickly. The ability to generate con- 
tigs with trait linked SNPs embedded means standard tools 



like BLAST can be used to identify the candidate function 
of the gene from which that contig arose, even when no 
specific genetic resources are available. 

Application of reference-free SNP detection methods 
and data structures could help to resolve the data sto- 
rage and access issues associated with alignment-based 
methods meaning that true genome wide SNP and var- 
iant searches can be more easily commoditised for the 
medical and clinical markets and therefore be a crucial 
element in the deployment of personal genomics based 
medicine. Plummeting sequence prices are starting to 
make individual personal genome sequencing a diagnos- 
tic reality. The small data overhead incurred for each 
new sample sequenced means much more information 
can be stored for similar cost which brings new diagnos- 
tic and research possibilities much closer. Being able to 
work with sequence regions not present in the available 
reference genomes and the SNPs in those regions means 
that new associations between disease and genetic varia- 
tions can be collected, and the reference free framework 
is fit to make this research possible directly from the 
clinic. Variant detection absolutely de novo like this 
means that the complex mutations that make up the 
genomic pathologies of cancers, for example, can be 
uncovered much more easily for each patient and case 
and the changes cross-referenced with the patients own 
individual reference. Furthermore the recent realisations 
that our genomes vary even within us, perhaps from tis- 
sue to tissue is much more tractable in the reference- 
free framework. 

We are learning that individual organisms genomes 
can be incredibly diverse and capturing that diversity 
will become vital in future research. The reference-free 
SNP detection methods are among the first steps in a 
new approach that will help relieve us from the limiting 
prior information in the reference genome. 
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